knitr::opts_chunk$set(echo = TRUE)

Introduction

During this assignment we will examine the pollution and weather data in Madrid. The initial aim is to analyse the relationships amongst the variables and create descriptive and graphical analyses. Then we will run a multi linear regression on Nitrogen Dioxide (NO2) levels in the city against explanatory variables such as weather data and other city pollutants such as SO2, Ozone and Particulate 2.5.

First we loaded the necessary libraries.

library(data.table)  
library(dplyr)
library(tidyr)
library(lubridate)
library(ggplot2)
library(corrplot)
library(gridExtra)
library(MASS)
library(ggthemes)

The pollution data came as 72 separate csv files, so we initially imported these into a single list via an lapply function. We made sure the information on the month and year were retained from the csv file titles.

The weather data was relatively clean already, so this dataset needed very little transformation.

setwd("/Users/charlotteleysen/Google Drive/*PROJECTS/IE/Term 1/R/Workgroup/workgroup data")
temp <- list.files(path = "/Users/charlotteleysen/Google Drive/*PROJECTS/IE/Term 1/R/Workgroup/workgroup data", pattern = ".csv")
myfiles <- lapply(temp, fread, sep=",")
myfiles <- setNames(myfiles, temp)
weather <- fread(file = "/Users/charlotteleysen/Google Drive/*PROJECTS/IE/Term 1/R/Workgroup/weather.csv", sep=",")

class(myfiles)
## [1] "list"
class(weather)
## [1] "data.table" "data.frame"

Next we needed to convert the list of monthly pollution data into one table. As the table structure of each month within the list was identical it was possible to simply combine the rows.

df <- bind_rows(myfiles, .id = "title")
 head(df)
##                   title day hour  station parameter value
## 1: hourly_data_11_1.csv   1    1 28079004         1     6
## 2: hourly_data_11_1.csv   1    1 28079008         1    12
## 3: hourly_data_11_1.csv   1    1 28079017         1    12
## 4: hourly_data_11_1.csv   1    1 28079018         1    10
## 5: hourly_data_11_1.csv   1    1 28079024         1     7
## 6: hourly_data_11_1.csv   1    1 28079035         1    11

Now we have a single table for pollution data with 6,471,098 observations and 6 variables (title, day, hour, station, parameter, value), and a single table for weather data with 2192 observations and 7 variables (date, temp_avg, temp_max, temp_min, precipitation, humidty and wind_avg_speed).

Cleaning The Data

Dates

The pollution dataset (df) will need the most work in cleaning. Currently the “title” variable gives noisy information about the month and year related to the observation, while the day variable is a separate column. In the below code we have cleaned up these variables and created one column related to the “Date” in the format of (dd/mm/yy).

We also made sure the date variable in the weather dataset was the same as that of pollution.

df$title <- gsub(pattern = "hourly_data_|.csv", replacement = "", x = df$title)
df <- separate(df, title, c("year","month"), sep = "_", remove = T)
df <- unite(df, "Date", c("day","month","year"), sep = "/")
df$Date <- as.Date(df$Date, format = "%d/%m/%y")
weather$date <- as.Date(weather$date, format = "%d/%m/%Y")

head(df)
##          Date hour  station parameter value
## 1: 2011-01-01    1 28079004         1     6
## 2: 2011-01-01    1 28079008         1    12
## 3: 2011-01-01    1 28079017         1    12
## 4: 2011-01-01    1 28079018         1    10
## 5: 2011-01-01    1 28079024         1     7
## 6: 2011-01-01    1 28079035         1    11
head(weather)
##          date temp_avg temp_max temp_min precipitation humidity
## 1: 2011-01-01      8.3     13.0      3.0          0.00       84
## 2: 2011-01-02      8.6     13.0      4.0          0.00       81
## 3: 2011-01-03      4.2      9.4     -1.6          0.00       86
## 4: 2011-01-04      6.5      8.0      4.1          0.00       93
## 5: 2011-01-05      8.9     10.0      6.3          0.00       90
## 6: 2011-01-06     12.2     15.0      8.9          0.51       87
##    wind_avg_speed
## 1:            5.2
## 2:            5.4
## 3:            3.5
## 4:            6.3
## 5:           10.4
## 6:           15.7

Missing Values

We did a brief check of any missing values below.

colSums(is.na(df))
##      Date      hour   station parameter     value 
##         0         0         0         0       335
colSums(is.na(weather))
##           date       temp_avg       temp_max       temp_min  precipitation 
##              0              0              0              0              0 
##       humidity wind_avg_speed 
##              0              0

We found 335 NA values in the “value” column in the pollution dataset (df). So we examined which parameters and stations these NAs related to.

missing_stations <- df[!complete.cases(df),]$station
data.frame(table(missing_stations))
##    missing_stations Freq
## 1          28079004    3
## 2          28079008   17
## 3          28079011    1
## 4          28079018    7
## 5          28079024  170
## 6          28079027    1
## 7          28079038   25
## 8          28079040    1
## 9          28079047    4
## 10         28079048    6
## 11         28079049    4
## 12         28079050    2
## 13         28079055   93
## 14         28079060    1
missing_parameters <- df[!complete.cases(df),]$parameter
data.frame(table(missing_parameters))
##   missing_parameters Freq
## 1                  1    2
## 2                  7   18
## 3                  9   21
## 4                 10   10
## 5                 30   19
## 6                 35    7
## 7                 44  258

The main station that contained missing values was Station 28079024 with 170 NAs. The main parameter that contained missing values was 44 (NMHC / Non Methane Hydrocarbons) with 258 NAs.

This problem is not so detrimental to the analysis when we aggregate pollution data into daily averages. During this process missing values will be removed.

Data Aggregation

Now it is time to aggregate the hourly data to give a daily average for each pollution parameter.

daily_df <- df[, list(mean_value = mean(value, na.rm = T)), by = c('Date','parameter')][order(Date)]

Transformation

In order to visualise and compare the observations each for pollutant, we transform the parameter variables into columns. Thus each observation is related to a single unique day which gives daily average values for each pollutant.

Finally we merge the pollution data with the weather data via common dates to create a wholistic final dataset.

spread_df <- spread(daily_df, parameter, mean_value)
merged_df <- merge(spread_df, weather, by.x = "Date", by.y = "date")

Now we have a complete and clean dataframe that we can start to work with.

Predicting NO2 with Linear Regression

Our initial analysis will be focused on four pollution variables (Nitrogen_Dioxide, Sulphur_Dioxide, Ozone, Particulates_2.5) and the weather data. Therefore we will take out all the irrelevent pollutants of our dataset and label our pollutant variables more descriptively.

main_df <- merged_df[ , -c("6","7","10","20","30","35","42","44")]
names(main_df)[2:5] <- c("Sulphur_Dioxide","Nitrogen_Dioxide","Particulates_2.5","Ozone")
str(main_df)
## Classes 'data.table' and 'data.frame':   2192 obs. of  11 variables:
##  $ Date            : Date, format: "2011-01-01" "2011-01-02" ...
##  $ Sulphur_Dioxide : num  10.71 11.93 11.91 8.84 9.51 ...
##  $ Nitrogen_Dioxide: num  41.5 48.5 63.6 46.3 51.5 ...
##  $ Particulates_2.5: num  9.36 9.08 11.94 9.4 10.51 ...
##  $ Ozone           : num  20.47 15.56 9.45 13.34 10.88 ...
##  $ temp_avg        : num  8.3 8.6 4.2 6.5 8.9 12.2 10.9 9.8 8.4 6.9 ...
##  $ temp_max        : num  13 13 9.4 8 10 15 13 13 11 8.3 ...
##  $ temp_min        : num  3 4 -1.6 4.1 6.3 8.9 8.7 7.6 6.6 5.2 ...
##  $ precipitation   : num  0 0 0 0 0 ...
##  $ humidity        : int  84 81 86 93 90 87 81 88 92 91 ...
##  $ wind_avg_speed  : num  5.2 5.4 3.5 6.3 10.4 15.7 15.6 14.3 6.5 7.4 ...
##  - attr(*, "sorted")= chr "Date"
##  - attr(*, ".internal.selfref")=<externalptr>

Data Visualisation

Now we start to explore the data visually.

Descriptive Summary

summary(main_df)
##       Date            Sulphur_Dioxide  Nitrogen_Dioxide  Particulates_2.5
##  Min.   :2011-01-01   Min.   : 1.679   Min.   :  7.819   Min.   : 2.625  
##  1st Qu.:2012-07-01   1st Qu.: 3.757   1st Qu.: 26.170   1st Qu.: 7.306  
##  Median :2013-12-31   Median : 5.483   Median : 35.646   Median :10.024  
##  Mean   :2013-12-31   Mean   : 5.889   Mean   : 38.828   Mean   :11.112  
##  3rd Qu.:2015-07-02   3rd Qu.: 7.388   3rd Qu.: 48.122   3rd Qu.:13.646  
##  Max.   :2016-12-31   Max.   :17.129   Max.   :105.077   Max.   :58.556  
##      Ozone            temp_avg        temp_max        temp_min     
##  Min.   :  3.979   Min.   :-0.50   Min.   : 3.30   Min.   :-7.000  
##  1st Qu.: 30.088   1st Qu.: 8.40   1st Qu.:14.00   1st Qu.: 3.000  
##  Median : 49.555   Median :14.60   Median :21.10   Median : 8.900  
##  Mean   : 47.866   Mean   :15.45   Mean   :22.04   Mean   : 8.755  
##  3rd Qu.: 65.058   3rd Qu.:22.70   3rd Qu.:30.10   3rd Qu.:14.600  
##  Max.   :110.287   Max.   :32.30   Max.   :42.00   Max.   :25.800  
##  precipitation        humidity     wind_avg_speed 
##  Min.   : 0.0000   Min.   :15.00   Min.   : 2.00  
##  1st Qu.: 0.0000   1st Qu.:36.00   1st Qu.: 6.30  
##  Median : 0.0000   Median :55.00   Median : 8.70  
##  Mean   : 0.9173   Mean   :55.04   Mean   :10.15  
##  3rd Qu.: 0.0000   3rd Qu.:73.00   3rd Qu.:12.60  
##  Max.   :48.0100   Max.   :99.00   Max.   :35.60

This brief statistical overview of the dataset gives a few insights into the data. First, all four pollutants have high maximums, compared to their inter-quartile ranges. This shows that the disitrubtions could be skewed to the right and any potential outliers would be at the top end of the ranges. Qualitatively, the pollutants seem to generally fall in the low end of the spectrum, during most days of the year but occasionally, due to some external factors, the record levels are much higher.

Looking at temperature variables, the maximum temperature recorded in Madrid is 42 degrees, while the lowest temperature is -7. Preciptation is highly skewed to the right, with most data points centered around zero. Humidity and average wind speed, on the other hand, are relatively more normally distributed.

Calendar Heatmap

Below we show a heatmap of Nitrogen Dioxide level over the time period of the data. It appears that NO2 levels are highest during the winter months.

cal_df <- main_df
cal_df$year<-as.numeric(as.POSIXlt(cal_df$Date)$year+1900)
cal_df$month<-as.numeric(as.POSIXlt(cal_df$Date)$mon+1)
cal_df$day<-as.numeric(as.POSIXlt(cal_df$Date)$mday+1)
cal_df$month_fac <-factor(cal_df$month,levels=as.character(1:12),labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"),ordered=TRUE)
cal_df$weekday <- as.POSIXlt(cal_df$Date)$wday + 1
cal_df$weekday_fac<-factor(cal_df$weekday,levels=rev(1:7),labels=rev(c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")),ordered=TRUE)
cal_df$monthweek <- ceiling((cal_df$day+(7-cal_df$weekday))/7)

ggplot(cal_df, aes(monthweek, weekday_fac, fill = Nitrogen_Dioxide)) + 
  geom_tile(colour = "white") + facet_grid(year~month_fac) + scale_fill_gradient(low="blue", high="red") +
  ggtitle("Heatmap of Nitrogen Dioxide levels") +  xlab("\n\nWeek of Month") + ylab("")

Scatterplots of each variables over time

It is important to have an idea how each variable appears over time. At the same time we can visually check for any apparent outliers. We took 5 times the standard deviations above and below the mean as a general check.

upper_threshold <- sapply(main_df[,-1], mean) + 5*sapply(main_df[,-1], sd)
lower_threshold <- sapply(main_df[,-1], mean) - 5*sapply(main_df[,-1], sd)

grid.arrange(
  ggplot(main_df, aes(x = Date, y = Sulphur_Dioxide)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(x = Date, y = Nitrogen_Dioxide)) +  geom_point(size = 0.7) + geom_line(aes(y = 50), color = "red"),
  ggplot(main_df, aes(x = Date, y = Particulates_2.5)) +  geom_point(size = 0.7) + geom_line(aes(y = upper_threshold["Particulates_2.5"]), color = "blue") + geom_line(aes(y=25),color = "red"),
  ggplot(main_df, aes(x = Date, y = Ozone)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(x = Date, y = temp_avg)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(x = Date, y = temp_max)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(x = Date, y = temp_min)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(x = Date, y = precipitation)) +  geom_point(size = 0.7) + geom_line(aes(y = upper_threshold["precipitation"]),color = "blue"),
  ggplot(main_df, aes(x = Date, y = humidity)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(x = Date, y = wind_avg_speed)) +  geom_point(size = 0.7) + geom_line(aes(y = upper_threshold["wind_avg_speed"]),color = "blue"),
  ncol = 3, widths = list(7,7,7), heights = list(20, 20, 20, 20)
)

These scatterplots show that both pollution and weather data are very seasonal and well-defined. The data generally lie within a specific yearly range.

Particulates_2.5 have 3 outliers. Precipitation has a few more outliers (more than 5 standard deviations above the mean), but they are still well within the theoretical range of precipitation so should not be ignored as observations. Most values of precipitation are “0”, which is sensible given that precipitation in Madrid is notably infrequent.

The red line shows where pollutants have crossed their daily threshold values according to a report online. There are 46 data points that cross the P_2.5 threshold, while there are 494 data points that cross the NO2 daily threshold of 50. This clearly indicates that Madrid has a problem with keeping NO2 level low. Source: http://www.madrid.es/UnidadesDescentralizadas/AreasUrbanas_EducacionAmbiental/Catalogo/AirQualityPlan2011-15.pdf

Scatterplots of explanatory variables vs NO2

The following scatterplots analyse the relationship between the explanatory variables and NO2. In order for the multi linear regression to be valid, the explanatory variables should have linear relationships with NO2.

grid.arrange(
  ggplot(main_df, aes(y = Nitrogen_Dioxide, x = Sulphur_Dioxide)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(y = Nitrogen_Dioxide, x = Ozone)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(y = Nitrogen_Dioxide, x = Particulates_2.5)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(y = Nitrogen_Dioxide, x = temp_avg)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(y = Nitrogen_Dioxide, x = temp_max)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(y = Nitrogen_Dioxide, x = temp_min)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(y = Nitrogen_Dioxide, x = precipitation)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(y = Nitrogen_Dioxide, x = humidity)) +  geom_point(size = 0.7),
  ggplot(main_df, aes(y = Nitrogen_Dioxide, x = wind_avg_speed)) +  geom_point(size = 0.7),
  ncol = 3, widths = list(7,7,7), heights = list(20, 20, 20)

)

Examining the plots above, the pollutant explanatory variables have clear linear relationships with NO2. SO2 and P2.5 are positively correlated, while Ozone is negatively correalted with NO2. Thus it appears that as the levels of NO2, SO2 and P2.5 increase, the level of Ozone decreases.

The temperature variables are all relatively similar and demonstrate a weaker negative relationship with NO2. Thus as temperature rises, the level of NO2 falls.

Precipitation and humidity have seemingly almost no correlation with NO2 as the data appear to be quite randomly distributed.

Finally, average wind speed is negatively correlated with NO2, so as wind speed increases, the level of NO2 decreases. However the rate of decrease in NO2 as speed increases seems to decrease, which suggests a lack of linearity.

Correlations

Here we have correlation matrix to show the correlations amongst each variable.

corr_df <- main_df[ ,2:length(main_df)] 
corrMat <- cor(corr_df)
corrplot(corrMat, method = "number", order = "FPC", type = "lower", tl.cex = 0.7,number.cex = 0.7, cl.cex = 0.7 )

The highest correlations with NO2 appear to be:
- Ozone (-0.71)
- SO2 (+0.67)
- P2.5 (+0.64)
- Wind Speed (-0.60)

This is in line with our analysis of the scatter plots above.

Unsurprisingly, it appears that the temperature variables are highly correlated with each other. In addition, Ozone and Humidity are also highly correlated with each other and with the temperature variables. According to background information on Ozone levels, it is common to see higher levels of Ozone during higher temperature periods and lower levels during colder periods. Humidity is inversely lower during periods of high temperatures.

These notable correlations within the explanatory variables are potential signs of multi-collinearity in a linear regression model. Therefore we will need to be cautious of this later on.

Implementing the Model

Now we are ready to implement the model. We ran a linear regression on NO2 against SO2, Ozone, P_2.5 and the weather variables.

lm1 <- lm(Nitrogen_Dioxide ~ . -Date, data = data.frame(main_df))
summary(lm1)
## 
## Call:
## lm(formula = Nitrogen_Dioxide ~ . - Date, data = data.frame(main_df))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.481  -4.493   0.050   4.339  24.098 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      44.38603    2.26953  19.557  < 2e-16 ***
## Sulphur_Dioxide   1.53615    0.07570  20.291  < 2e-16 ***
## Particulates_2.5  0.89908    0.03579  25.124  < 2e-16 ***
## Ozone            -0.37498    0.01162 -32.279  < 2e-16 ***
## temp_avg         -0.24635    0.22108  -1.114    0.265    
## temp_max          0.64940    0.12552   5.174 2.50e-07 ***
## temp_min         -0.68495    0.13171  -5.201 2.17e-07 ***
## precipitation     0.20618    0.04976   4.143 3.55e-05 ***
## humidity         -0.11855    0.01796  -6.602 5.08e-11 ***
## wind_avg_speed   -0.47516    0.03781 -12.565  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.908 on 2182 degrees of freedom
## Multiple R-squared:  0.8383, Adjusted R-squared:  0.8376 
## F-statistic:  1257 on 9 and 2182 DF,  p-value: < 2.2e-16

The results show that are variables are statistically significant to the highest degree, expect temp_avg, which has a p value of 0.265. This is due to the high correlation between the three temperature variables, and suggests the linear would be better if we excluded this variable.

SO2 and P2.5 have positive coefficients while Ozone has a negative coefficient. Max and min temperatures have opposing coefficients (+0.64 and -0.68). Precipitation is positively correlated with NO2. Humidity and average wind speed are both negatively correlated with NO2.

In order to analyse the validity of the model, we need to examine the distribution of the residuals. For this we plot the residuals in a histogram and a boxplot. The Q-Q plot is created to analyse any skew in the distribution of the model.

resid <- residuals(lm1)
par(mfrow = c(1,3))
hist(x = resid, breaks = 30, main = "Histogram of Residuals"); grid()
boxplot(resid, main = "Boxplot of Residuals"); grid()
qqnorm(resid, main = "Q-Q plot of Residuals"); grid()

par(mfrow = c(1,1))

The above plots suggest the model is indeed valid. The histogram and boxplot shows the residuals distributed normally around 0. The Q-Q plot is also a clear straight line, with a few outlier observations away for the normal line.

In order to detect further non-linearity and unequal error variances, we examine a plot of fitted values versus the residuals.

temp_df <- data.table(cbind(fitted = lm1$fitted.values, residuals = resid))
ggplot(temp_df,  aes(x = fitted, y = residuals)) + geom_point(size = 0.7)

cor(x = temp_df$fitted, y = temp_df$residuals)
## [1] -3.048858e-17

The correlation statistic and the plot show no meaningful correlation.

Standardised Linear Model

The explanatory variables do not all have the same units so it may be worthwhile standardising the variables before performing the regression model. This will also help giving an order of importance to the variables and deal with some multi collinearity issues that we saw earlier in the scatterplots.

Before we perform this next model, we will take out the temp_avg variable.

scaled_df <- data.table(sapply(main_df[,-c('Date','temp_avg')], scale))
head(scaled_df)
##    Sulphur_Dioxide Nitrogen_Dioxide Particulates_2.5     Ozone   temp_max
## 1:        1.804316        0.1564815       -0.3269371 -1.197182 -1.0029587
## 2:        2.260953        0.5626796       -0.3806518 -1.411800 -1.0029587
## 3:        2.253024        1.4469783        0.1556679 -1.679097 -1.4023486
## 4:        1.104553        0.4355844       -0.3196178 -1.508833 -1.5576668
## 5:        1.353938        0.7402077       -0.1118426 -1.616272 -1.3357836
## 6:        1.026629       -0.2041447       -0.7728274 -1.068411 -0.7810755
##       temp_min precipitation humidity wind_avg_speed
## 1: -0.83367020    -0.2833164 1.387511    -0.92693701
## 2: -0.68880706    -0.2833164 1.243794    -0.88944769
## 3: -1.50004066    -0.2833164 1.483322    -1.24559619
## 4: -0.67432075    -0.2833164 1.818660    -0.72074577
## 5: -0.35562183    -0.2833164 1.674943     0.04778519
## 6:  0.02102234    -0.1258044 1.531227     1.04125205
lm2 <- lm(Nitrogen_Dioxide ~ ., data = data.frame(scaled_df))
summary(lm2)
## 
## Call:
## lm(formula = Nitrogen_Dioxide ~ ., data = data.frame(scaled_df))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5436 -0.2605  0.0033  0.2530  1.4007 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.827e-17  8.608e-03   0.000        1    
## Sulphur_Dioxide   2.424e-01  1.154e-02  21.008  < 2e-16 ***
## Particulates_2.5  2.809e-01  1.116e-02  25.183  < 2e-16 ***
## Ozone            -5.006e-01  1.551e-02 -32.283  < 2e-16 ***
## temp_max          2.783e-01  3.380e-02   8.233 3.10e-16 ***
## temp_min         -3.280e-01  2.480e-02 -13.228  < 2e-16 ***
## precipitation     3.896e-02  9.399e-03   4.146 3.52e-05 ***
## humidity         -1.390e-01  2.133e-02  -6.516 8.92e-11 ***
## wind_avg_speed   -1.481e-01  1.177e-02 -12.584  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.403 on 2183 degrees of freedom
## Multiple R-squared:  0.8382, Adjusted R-squared:  0.8376 
## F-statistic:  1413 on 8 and 2183 DF,  p-value: < 2.2e-16
coef_importance <- data.frame(coeff = sort(abs(lm2$coefficients), decreasing = T))
coef_importance$weather <- rownames(coef_importance)
ggplot(coef_importance, aes(y = coeff, reorder(weather, coeff))) + geom_bar(stat = 'Identity') + coord_flip() + labs(x = "weather")


The linear model summary shows all variables are strongly significant with small p values. The bar plot shows the variables sorted in order of importance for explaining NO2. The three most important variables appears to be ozone, minimum temperature and particulates 2.5.

Although the average temperature variable has been removed in this model, the adjusted R squared has stayed the same in both models.

Fitted vs Real Values

As a final check of our model we will plot the fitted values of the model versus the actual values.

ggplot(main_df, aes(x = lm1$fitted.values, y = Nitrogen_Dioxide)) + geom_point(size = 0.7)

There is a clear straight diagonal line with around the same distribution width across the whole range which indicates an equal distribution of variance or homoskedasticity.

Additional Analysis

In our additional analysis we decided to look at station specific data on a yearly basis. We wanted to find out if there was any pattern change over the years.

First we want to identify the locations of each station, so we import another dataset which links the precise location of the station codes.

# station data
stations<-fread(input = '/Users/charlotteleysen/Google Drive/*PROJECTS/IE/Term 1/R/Workgroup/run_results.csv')
stations$Station_name<-sapply(X = stations$Station_name,gsub,pattern = 'ESTACIÓN: ',replacement = '')
stations$Station_code<-sapply(X = stations$Station_code,gsub,pattern = 'CÓDIGO: ',replacement = '')
stations$Station_code <- as.integer(stations$Station_code)

head(stations)
##        Station_name Station_code      Station_lon       Station_lat
## 1:  Plaza de España     28079004  3º 42' 44.40""W  40º 25' 26.37""N
## 2: Escuelas Aguirre     28079008 3º 40' 56,35'' W 40º 25' 17,63'' N
## 3:    Ramón y Cajal     28079011 3º 40' 38,47'' W 40º 27' 05,30'' N
## 4:     Arturo Soria     28079016 3º 38' 21,24'' W 40º 26' 24,17'' N
## 5:       Villaverde     28079017 3º 42' 47,98"" W 40º 20' 49,56"" N
## 6:        Farolillo     28079018 3º 43' 54,60'' W 40º 23' 41,20'' N

Next we want to manipulate our original dataset. Instead of aggregating the station data, we will aggregate the dates and look at the data from a year perspective.

df_new <- df
df_new$year <- format(df_new$Date, "%Y")
df_new <- df_new[, list(mean_value = mean(value, na.rm = T)), by = c('year','station','parameter')]
df_new <- spread(df_new, parameter, mean_value)

A check of the NAs in each column show that parameters 7 and 8 have no missing values. Therefore, as this is the most complete data, we will focus the rest on our analysis on these two parameters, namely Nitrogen Oxide (NO) and Nitrogen Dioxide (NO2).

colSums(is.na(df_new))
##    year station       1       6       7       8       9      10      14 
##       0       0      84      84       0       0     108      72      60 
##      20      30      35      42      44 
##     108     108     108     121     121
df_new2 <- df_new[, c('year','station','7','8')]
names(df_new2)[c(3,4)]<-c('NO','NO2')

Now we need to label the station codes with real location names, by merging with the stations dataset.

df_new2_merged <- merge(df_new2,stations, by.x = "station", by.y = "Station_code")
df_new2_merged <- df_new2_merged[, c("year","Station_name","NO","NO2")]
colnames(df_new2_merged)[2] <- "station"

head(df_new2_merged)
##    year         station       NO      NO2
## 1: 2011 Plaza de España 48.16923 50.92531
## 2: 2012 Plaza de España 42.31437 46.30483
## 3: 2013 Plaza de España 36.52791 45.68268
## 4: 2014 Plaza de España 36.00515 37.94380
## 5: 2015 Plaza de España 45.38290 50.87986
## 6: 2016 Plaza de España 36.52716 45.63555

Let’s first look at aggregated NO data for 2016 for each station and order by NO level.

df_2016<-setorder(df_new2_merged[year==2016,],-NO)

df_2016$station<-as.factor(df_2016$station)
df_2016$station <- factor(df_2016$station, levels = df_2016$station[order(df_2016$NO)])

d<-ggplot(data = df_2016,aes(x = df_2016$station, y = df_2016$NO))
d + geom_point() + theme_economist() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  ggtitle('NO levels by station locations in 2016') + labs(x="Stations", y = "NO")


Unsurprisingly, we see that the areas that have the highest green density have the lowest overall average NO levels, namely Casa Campo and Retiro Park. El Prado has the lowest level, which is next to the park. Plaza Espana and Plaza Fernandez Ladrada have the highest levels, which are four times as much as the lowest locations.

Now we examine each location’s NO change between 2011 and 2016.

plot_stations<-function(data,n_stations){
  stations1<-unique(data$station)
  plot_data<-data.frame(data[data$station == stations1[n_stations],])
  
  d<-ggplot(data = plot_data,aes(x = plot_data$year ,y =plot_data$NO))
  d + geom_point() + theme_economist() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    ggtitle(plot_data[1,2]) + theme(plot.title = element_text(size=7)) + labs(x=NULL, y=NULL) + ylim(0, 60)
}
grid.arrange(
  
  plot_stations(df_new2_merged,1),
  plot_stations(df_new2_merged,2),
  plot_stations(df_new2_merged,3),
  plot_stations(df_new2_merged,4),
  plot_stations(df_new2_merged,5),
  plot_stations(df_new2_merged,6),
  plot_stations(df_new2_merged,7),
  plot_stations(df_new2_merged,8),
  plot_stations(df_new2_merged,9),
  plot_stations(df_new2_merged,10),
  plot_stations(df_new2_merged,11),
  plot_stations(df_new2_merged,12),
  plot_stations(df_new2_merged,13),
  plot_stations(df_new2_merged,14),
  plot_stations(df_new2_merged,15),
  plot_stations(df_new2_merged,16),
  plot_stations(df_new2_merged,17),
  plot_stations(df_new2_merged,18),
  plot_stations(df_new2_merged,19),
  plot_stations(df_new2_merged,20),
  plot_stations(df_new2_merged,21),
  plot_stations(df_new2_merged,22),
  plot_stations(df_new2_merged,23),
  plot_stations(df_new2_merged,24)
)


From the graphs above we notice that there seems to be an anomaly 2015. From 2011 to 2014, there is a clear downward trend in NO levels across all locations, but in 2015 we see a spike back up.

Lets try to see how the weather behaved during this period to shed some light on what happened.

yearly<-weather[,lapply(.SD,mean),by = year(date)]

data_plot<-data.frame(yearly)

dd<-ggplot(data = data_plot,aes(x = data_plot$year,y = data_plot$temp_avg))
plot1<-dd + geom_point() + theme_economist() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle('Average temperature') + theme(plot.title = element_text(size=15)) + labs(x ="Year", y = "Avg Temp")

dd<-ggplot(data = data_plot,aes(x = data_plot$year,y = data_plot$precipitation ))
plot2<-dd + geom_point() + theme_economist() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle('Precipitation') + theme(plot.title = element_text(size=15)) + labs(x ="Year", y = "Precipitation")

dd<-ggplot(data = data_plot,aes(x = data_plot$year,y = data_plot$humidity ))
plot3<-dd + geom_point() + theme_economist() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle('Humidity') + theme(plot.title = element_text(size=15)) + labs(x ="Year", y = "Humidity")

dd<-ggplot(data = data_plot,aes(x = data_plot$year,y = data_plot$wind_avg_speed))
plot4<-dd + geom_point() + theme_economist() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle('Wind speed') + theme(plot.title = element_text(size=15)) + labs(x ="Year", y = "Avg Wind Speed")

grid.arrange(
  plot1,
  plot2,
  plot3,
  plot4
)


Here we see that 2015 was a year that was hotter than normal, with little rain, wind and a drop in humidity. In response to this increase in pollution, the Madrid administration office introduced tougher new measures in 2016 to restrict traffic. As a result, with the help of more favourable weather conditions, we see a significant drop in pollution levels in madrid in 2016.